Clinical Concept Embeddings Learned from Massive Sources of Medical Data

نویسندگان

  • Andrew L. Beam
  • Benjamin Kompa
  • Inbar Fried
  • Nathan P. Palmer
  • Xu Shi
  • Tianxi Cai
  • Isaac S. Kohane
چکیده

Word embeddings have emerged as a popular approach to unsupervised learning of word relationships in machine learning and natural language processing. In this article, we benchmark two of the most popular algorithms, GloVe and word2vec, to assess their suitability for capturing medical relationships in large sources of biomedical data. Leaning on recent theoretical insights, we provide a unified view of these algorithms and demonstrate how different sources of data can be combined to construct the largest ever set of embeddings for 108,477 medical concepts using an insurance claims database of 60 million members, 20 million clinical notes, and 1.7 million full text biomedical journal articles. We evaluate our approach, called cui2vec, on a set of clinically relevant benchmarks and in many instances demonstrate state of the art performance relative to previous results. Finally, we provide a downloadable set of pretrained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings. Address correspondence to: Andrew Beam, Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston MA, 02115 [[email protected]] Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 2 Introduction Word embeddings have become an extremely popular way to represent sparse, high-dimensional data in machine learning and natural language processing (NLP). Modern notions of word embeddings based on neural networks have their roots in the neural language model of Bengio et. al[1], though the idea is closely related to many other approaches, notably latent semantic analysis (LSA)[2] and hyperspace analogue to language (HAL)[3]. Word embeddings are motivated by the observation that traditional representations for words, such as a one-hot encoding, are high dimensional and inefficient, since such an encoding captures none of the similarity or correlation information between words in the source text. The central idea is that a word can be characterized by “the company it keeps”[4], thus context words which appear around a given word encode a large amount of information regarding that word’s meaning. Word embeddings model this contextual information by creating a lower-dimensional space such that words that appear in similar contexts will be nearby in this new space. The embedding approach in word2vec has proven to be wildly popular since its introduction, and embeddings are now standard components in many NLP tasks. The main application has been in the use of “transfer learning”, where embeddings are first learned using extremely large sources of unstructured text (such as web-crawls, Wikipedia dumps, etc). The embeddings are then used in a supervised task as components of a model which accepts the sequence of pre-trained embeddings as inputs (e.g. a recurrent neural network). It has been shown that transfer learning can work as well as it does for image data[6], opening up numerous possibilities to exploit transfer learning in many NLP applications. Within the context of medical data, recent examples have shown that transfer learning works very well for imaging tasks[7], [8], due in large part to the availability of computer vision models[9]–[11] that were pre-trained on the Imagenet database[12]. It is reasonable to believe that the availability of pre-trained embeddings for medical concepts could catalyze a similar set of advancements for non-imaging machine learning Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 3 tasks, such as clinical NLP and modeling of the electronic healthcare record (EHR). The primary goal of this work is to construct and evaluate a comprehensive set of embeddings, which we refer to as cui2vec, using extremely large sources of healthcare data and to make these embeddings available to the larger research community to use. As a secondary goal, we evaluate the performance of the two primary embedding algorithms, word2vec and GloVe, as well providing practical advice for learning embeddings using clinical data. Overview of word2vec and GloVe word2vec The original work which introduced word2vec[5] actually contains a collection of models and algorithms including the continuous bag of words (CBOW) model and the skip-gram model. For example, the CBOW model predicts the probability of the target word given its context defined within a window. The skip-gram model predicts the surrounding context given the target word, and this model will be our focus as it has proven to be much more successful than CBOW. Specifically, the skip-gram model[5] seeks to construct vector representations of a target word w and a context word c such that the conditional probability is high for pairs that co-occur frequently in the source text. For the remainder of this paper we will use w and c to refer to the target word and context word respectively, and use !""⃗ and $⃗ to refer to the 1×d dimensional target word and context embeddings. Under the skipgram model, the conditional probability of observing context word c within a fixed window given the target word w is proportional to the dot-product of their corresponding vectors, and is given by the softmax function below: &($|!) = exp (!""⃗ $⃗/) ∑ exp (!""⃗ $⃗1 /) 1 Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 4 where the sum in the denominator is over all unique context words in the source corpus. Note that this sum is generally intractable and requires approximations to estimate efficiently. Thus, the vectors !""⃗ , $⃗ directly encode information about how likely word w is to appear in a randomly selected piece of text, given word c has been observed. A key feature of word2vec are tricks to enable efficient training (e.g. as negative sampling) which approximates the sum in the denominator by randomly sampling k context words which do not appear in the current window. This allows the algorithm to be run with bounded memory requirements and in a parallel fashion, which improves the training speed and enables training on very large corpora [5]. Indeed, the key point of Mikolov et. al was that training a simple and scalable model with more data results in better accuracy than a complex non-linear model on a variety of benchmarks. GloVe Global Vectors for Word Representation (GloVE) was introduced shortly after the Mikolov et. al and differs in several important ways. GloVE produces word embeddings by fitting a weighted log-linear model to co-occurrence statistics. Given that a target word w and a context word c co-occur y times, GloVe solves the following least-squares optimization problem: 345678{:""⃗ ,;⃗,<=,<>} @(A)(!""⃗ $⃗ / + C: + C; − log (A))H where C:, C; are word and context biases, respectively and @(A) is a weighting function and is given by: @(A) = I A AJKL M N 7@ A < AJKL and is 1 otherwise. The final embedding for word i is the sum of the resulting word and context vectors for that word. This is repeated for all w,c pairs and is trained iteratively using stochastic gradient descent. The most expensive step is the construction of the term-term co-occurrence matrix, which is necessary before training can begin. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 5 Embeddings as a Factorization of a Modified Co-occurrence Matrix In a previous work[13], Levy and Goldberg showed that the skip-gram model with negative sampling (SGNS), which is often considered to be state-of-the-art[14], is implicitly factorizing a shifted, positive pointwise mutual information (PMI) matrix of word-context pairs. Pointwise mutual information (PMI) is a measure of association between a word and a context word, and can be computed from the counts of word-context pairs in the corpus, given by: PQR(!, $) = S(:,;) S(:)∗S(;) (1) where &(!, $) is the number of times word w and context-word c occur in the same context window divided by the total number of word-context pairs, whereas &(!), &($) are the singleton frequencies of w and c, respectively. If we shift the PMI by some constant log(U) (where k is the number of negative samples in the original word2vec paper) and set all negative entries to 0, and factor the resulting shifted positive pointwise mutual information matrix (SPPMI) we recover the implicit objective of word2vec’s SGNS model. The element wise SSPMI transformation is shown below: VPPQR(!, $) = max (PQR(!, $) − log(U) , 0). Therefore, one can simply factorize the SSPMI matrix using any factorization method, such as a singular value decomposition (SVD), to obtain a lower-dimension embedding of the words. This finding is critical as it links word2vec to traditional count-based methods which are based on co-occurrence statistics. GloVe was originally presented in terms of explicit matrix factorization and provides an algorithm to perform this factorization (stochastic gradient descent to minimize sum-of-squared error). Thus, under this unified framework the starting point for both word2vec and GloVe is the construction of a term-term co-occurrence matrix. This insight is what allows us to use these algorithms on problems which may contain non-textual data sources, as we can materialize a co-occurrence matrix using any Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 6 data where such co-occurrences can be computed. Then we simply use the GloVe algorithm to directly factor this matrix or use SVD to factor the SSMPI matrix to create word2vec style embeddings. cui2vec Overview Medical data is multi-modal by nature and comes in many forms including free text (in medical publications and clinical notes) and billing codes for diagnoses and procedures in the electronic healthcare record (EHR). The cui2vec system works first mapping all of these concepts into a common concept unique identifier space (CUI) using a thesaurus from the Unified Medical Language System (UMLS). Next, a CUI-CUI co-occurrence matrix is constructed, but the way a co-occurrence is counted depends on the source data. For non-clinical text data (e.g. journal articles), it is first preprocessed (see methods) and chunked into fixed length windows of 10 words, and a co-occurrence is counted as the appearance of a CUI-CUI pair in the same window. For claims data, ICD-9 codes are mapped to UMLS CUIs and a co-occurrence is counted as the number of patients in which two CUIs appear in any 30-day period. Finally, for the clinical notes, we counted a co-occurrence as two CUIs appearing in the same 30-day ‘bin’ in a similar fashion to [15], but see the original publication[16] for more details on this definition. Once a master co-occurrence matrix has been constructed using all sources of data, it can be directly factored by GloVe or transformed into a SSPMI matrix and factored using SVD to create word2vec embeddings. Related Work There is a long history of machine learning and natural language processing for clinical uses, but for the purposes of this paper we confine our review to papers that are directly seeking to create low dimensional representations of clinical concepts, in the spirit of word2vec and GloVe. The first Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 7 investigations [17]–[19] using word2vec for medical concepts were performed shortly after the original word2vec paper appeared in 2013 and reported mixed results, though De Vine et. al reported state of the art performance with respect to human assessments of concept similarity and relatedness. Liu et. al [20] used embeddings jointly trained on Wikipedia and ICU notes to perform automatic expansion of abbreviations which are common in clinical notes. Lastly, Choi et. al[15] performed the work that is most comparable to this study, which used similar sources of data to create embeddings for UMLS CUIs. Choi et. al used a claims database of 4 million patients and a novel methodology to create a set of clinical embeddings as well as the notes from Finlayson et. al [16]. The work presented here differs in several important ways. First, we have access to a much larger claims database of 60 million patients and a larger set of 1.7 million full text articles (not just abstracts), which should enable both a much larger and higher quality set of embeddings. Secondly, the embeddings produced by Choi et. al are different for each data source, whereas we map all concepts into a common co-occurrence space to produce a single set of embeddings that can be used on tasks which different kinds of clinical data. We also present a new and expanded evaluation methodology that is both more interpretable and we believe to be a more natural way to benchmark sets of clinical embeddings. Finally, we believe that our approach incorporates many of the best practices with respect to tuning parameters (see methods) which also results in increased performance. In conclusion, this work presents results in a new set of embeddings for 108,477 medical concepts, the largest ever such collection, which are derived from three sources of clinical data and are equal to or exceed the existing state of the art on nearly all benchmarks. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 8 Materials and Methods Data Sources The data come from three independent sources – an un-identifiable claims database from a nationwide US health insurance plan with 60 million members over the period of 2008-2015, a dataset of concept co-occurrences from 20 million notes at Stanford [16], and an open access collection of 1.7 million full text journal articles obtained from PubMed Central. Text Normalization and Preprocessing For text data it is important to first normalize against some standard vocabulary or thesaurus. Word embeddings operate on tokens, and many medical concepts can span multiple tokens. To collapse multiword concepts into a single token we used the Narrative Information Linear Extraction (NILE) [21] system normalized against the Systematized Nomenclature of Medicine Clinical Terms (SNOMEDCT) [22] reference thesaurus. SNOMED-CT IDs were then mapped to concept unique identifiers (CUIs) from the Unified UMLS [23]. The pipeline converts all letters to lowercase, removes punctuation, and replaces all medical concepts with their CUI representation (e.g. ‘bronchopulmonary dysplasia’ with C0006287 and ‘resulting from’ with C0678226). For example, our pipeline would transform the sentence (taken from [24]): "Bronchopulmonary Dysplasia was first described by Northway and colleagues in 1967 as a lung injury in a preterm infant resulting from oxygen and mechanical ventilation." into the following normalized representation: “C0006287 was first described by northway and colleagues in 1967 as a C0024109 C3263722 in a C0021294 C0678226 C0030054 and C0199470” Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 9 Benchmarks and Evaluation The benchmarking strategy leverages previously published “known” relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and use this similarity to assess whether or not the two concepts are related. Cosine similarity (cos) between word vectors !""⃗ Z, !""⃗ H is given by: cos(!""⃗ Z, !""⃗ H) = !""⃗ Z!""⃗ H ]|!""⃗ Z|]H]|!""⃗ H|]H and is 1 if the vectors are identical and 0 if they are orthogonal. One approach would be to rank the cosine similarity for a known relationship against all others via a ranking metric such as mean-precision or discounted cumulative gain. However, such a strategy has several limitations. The primary issue is that many concepts may correctly be ranked higher than the query concept, but are not part of the database of known relationships. Thus, a ranking metric may incorrectly penalize a set of embeddings simply because some true relationships were ranked higher but were not included in the list of “known” relationships. Instead, we adopt an approached based on the notion of statistical power. For a known relationship pair (x,y), we first compute the null distribution of scores by drawing 10,000 bootstrap samples (x*,y*) where x* and y* belong to the same category as x and y, respectively. For example, when assessing whether “preterm infant” (which is a disease or syndrome) is associated with “bronchopulmonary dysplasia” (also a disease or syndrome) we would randomly sample two concepts from the “disease or syndrome” class and compute their cosine similarity, and then repeat this procedure 10,000 times to create the bootstrap distribution. We then compare the observed score between x and y and declare it statistically significant if it is greater than the 95th percentile of the bootstrap distribution (e.g. p < 0.05 for a one-sided test). Applying this procedure to the collection of known relationships, we calculate the statistical power to reject the null of no relationship which is the quantity we report in all Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 10 experiments, except for the comparison to human assessments of similarity. This metric has the added benefit of being easy to interpret, as it is an estimate of the fraction true relationships discovered, given a tolerance for a 5% false positive rate. Below is a list of the benchmarks used in this study, along with some details that are specific to each: • Comorbidity Benchmarks: A comorbidity is a disease or condition that frequently accompanies a primary diagnosis. We have hand curated a set of comorbid conditions for Addison’s disease, autism, heart disease, obesity, schizophrenia, type 1 diabetes and type 2 diabetes. These comorbidities were extracted from the Mayo Clinic’s Encyclopedia of Diseases and Conditions[25], Wikipedia, and the Merck Manuals[26]. Please refer to the supplement for more information regarding these benchmarks. o Example Relationship: Primary condition: premature infant (CUI: C0021294) Comorbidity: bronchopulmonary dysplasia (CUI: C0006287) • Causative Relationships: The UMLS contains a table (MRREL) of entities known to be the cause of a certain result. We extracted known instances of the relationships cause of and causative agent, and induces from the MRREL table. We computed the null distribution for these relationships by computing the similarity of randomly sampled concepts with the same semantic type as the cause and randomly sampled concepts with the same semantic type the result. o Example Relationship: Cause: Jellyfish sting (CUI: C0241955) Result: Irukandji syndrome (CUI: C1655386) • National Drug File Reference Terminology (NDF-RT): The NDF-RT was created by the U.S. Department of Veterans Affairs, Veterans Health Administration. We extracted drug-condition Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 11 relationships using the may prevent and may treat relationships. We assessed power to detect may treat and may prevent relationships using bootstrap scores of random drug-disease pairs. o Example Relationship: Drug: abciximab (CUI: C0288672) May Treat: Myocardial Ischemia (CUI: C0151744) • UMLS Semantic Type: Semantic types are meta-information about which category a concept belongs to, and these categories are arranged in a hierarchy. We extracted the most specific semantic type available for each concept from the MRSTY file provided by UMLS. To assess power to detect if two concepts belonged to the same semantic type, we randomly sampled concepts from different semantic type classes and computed a marginal null distribution of scores. o Example Relationship: Concept: Metronidazole (CUI: C0025872, Semantic Type: Pharmacologic Substance) Concept: Clofazimine (CUI: C0008996, Semantic Type: Pharmacologic Substance) • Human Assessment of Concept Similarity: Previous work [27] has assessed how resident physicians perceive the relationships among 566 pairs of UMLS concepts. Each concept pair has an average measure of how similar or related two concepts are to be as judged by resident physicians. We report the spearman correlation between the human assessment scores and cosine similarity from the embeddings as the primary endpoint for this benchmark. Implementation Details There are many hyper-parameters associated with both word2vec and GloVe that can have a dramatic effect on performance. In word2vec parameters such as the number of negative samples, the size of the context window, the amount of smoothing for the context singleton-frequencies, and whether or not the context vectors are used to construct the final embeddings are all options that the practitioner must Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 12 choose. Levy and Goldberg [28] conducted a systematic set of experiments on the effects of these hyperparameters on the performance of word2vec, and we follow their recommendations in this work. Specifically, we used the following settings for all word2vec experiments that are based on a singular value decomposition (SVD): • Smoothing of singleton frequencies by a constant exponential term. Instead of using &(!) in (1), we instead use &(!)N, where a is set to 0.75. In Levy and Goldberg, they recommend only smoothing the context singleton frequencies, but our co-occurrence matrices are symmetric so there is no difference in the singleton frequency when it is a “word” and when it is a “context”. • The SVD of the sparse SPPMI matrix was performed using the augmented implicitly restarted Lanczos bidiagonalization algorithm[29] with the irlba package[30] in the R programming language. • We set U = 1 in the SPPMI transformation (i.e. no shift). • We construct the final embeddings using a symmetrically scaled sum of the word and context vectors resulting from the singular value decomposition. Given the first _ singular vectors and singular values resulting from the SVD of a SPPMI matrix X, V`ab(c) = dbΣbVb, the set of _dimensional word vectors W are constructed as follows: gh = dbiΣb jk = b̀iΣb g = gh + jk For the comparison to the traditional word2vec algorithm on the articles from PubMed, we used the implementation available in the gensim python package [31]. We used the skip-gram algorithm, hierarchical softmax, 10 negative samples, and a window size of 10. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 13 We used the implementation of GloVe available in the R package text2vec [32]. We used the sum of target word and context vectors as the final embedding and set the AJKL = 100. As a baseline we also include a principal component analysis (PCA) of the unmodified co-occurrence matrix, where we took the top-d left singular vectors (e.g. db) as the embeddings. This was also achieved using the irlba package. Results Benchmark Results In total we were able to estimate embeddings for 108,477 unique concepts, making this the largest such collection embeddings of medical concepts. Figure 1 shows a visualization of the various intersections of the 108,477 concepts found across the different sources of data using the UpSet visualization method [33], [34]. Most of the concepts appear in only one corpus, however 16,299 (14%) appeared in multiple sources. Figure 2 contains a 2-dimensional t-sne visualization of the final cui2vec embeddings, colored by the source data. Next, we compared embeddings created by GloVe, word2vec, and PCA on our suite of benchmarks to determine which algorithm and dimensionality produced the best results. These results are shown in Figure 3. The best configuration was word2vec with an embedding dimension of 500, as it achieved the highest performance across all benchmarks. Interestingly, we saw only a modest effect of embedding dimension on the benchmarks based on power (Figure 3, Panel A), though there was a noticeable effect on the similarity and relatedness tests (Figure 3, Panel C). We also provide each algorithm’s performance across the three primary sources of data (insurance claims, clinical notes, and PubMed articles) in the supplement. Of note, the most direct comparison we could make to the original Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 14 word2vec algorithm was using PubMed articles. On this dataset, word2vec based on a SVD was better than the original algorithm (Supplemental Figure 2, Panel A/B). In most tests, word2vec out performed both GloVe and PCA, so we selected 500-dimensional word2vec style embeddings as our final embedding set. These embeddings are referred to as cui2vec in all subsequent experiments. Comparison to Previous Results We evaluated previously published embeddings obtained through the clinicalml github repository (https://github.com/clinicalml/embeddings) for comparison to our cui2vec embeddings. Note that all three of the comparison embeddings come from different data sources and have very few concepts in common, so we were forced to pairwise comparisons between cui2vec and each set of embeddings. The first comparison was against 300-dimenional embeddings for 15,905 concepts (of which 12,568 were in common with cui2vec) derived from a claims database of 4 million patients. The results are shown in Figure 4, Panel A and Figure 5 Panel A. We observed that cui2vec outperformed the reference embeddings in most tasks, in some instances by a substantial margin, though the embeddings from Choi et. al had the edge in the human assessment benchmark. Next we compared 300-dimension embeddings for 28,394 concepts derived from the same set of clinical notes in [16] published as part of [15]. In total, there were 21,789 concepts in common between cui2vec and this set of embeddings. Here cui2vec was again better in most benchmarks, in some cases by a large margin (Figure 4, Panel B and Figure 5, Panel B). Finally, we compared cui2vec against 200-dimensional embeddings for 59,266 concepts derived from 348,566 PubMed abstracts, first published in [18]. There were 33,376 concepts in common that were used for benchmarking. On this dataset we observed a huge relative improvement and cui2vec was uniformly better across all benchmarks (Figure 4, Panel C and Figure 5, Panel C). Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 15 Discussion In this study we have created the most comprehensive set of 108,299 clinical embeddings to date using extremely large and multi-modal sources of medical data. When compared to previous results, the cui2vec embeddings achieve state of the art performance in many instances. Even though there is more healthcare data than ever, most of it is either unlabeled or weakly labeled, so the ability to extract meaningful structure in an unsupervised manner is extremely important. Another potential obstacle is that most sources of healthcare data are not easily shareable, which limits some researchers to small sources of local data. We hope to reduce both of these barriers by providing our cui2vec embeddings that were created using large and national sources of healthcare data. We believe that these embeddings will be generally useful for a variety of clinically oriented machine learning tasks and have made them available at https://figshare.com/s/00d69861786cd0156d81 Availability of Data The pre-trained embeddings can be downloaded at: https://figshare.com/s/00d69861786cd0156d81 An interactive tool project to explore the embeddings can be found at: cui2vec.dbmi.hms.harvard.edu Acknowledgements The authors wish to thank Griffin Weber, MD/PhD for his help constructing the co-occurrence tables from the claims database and Dan Traviglia for his assistance in setting up the interactive tool. Additionally, the authors wish to thank Brett Beaulieu-Jones, PhD and Sam Finlayson, MS for their helpful feedback and comments on a draft of this manuscript. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 16 References [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A Neural Probabilistic Language Model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, 2003. [2] M. W. Berry, S. T. Dumais, and G. W. O’Brien, “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Rev., vol. 37, no. 4, pp. 573–595, 1995. [3] K. Lund and C. Burgess, “Producing high-dimensional semantic spaces from lexical cooccurrence,” Behav. Res. Methods, Instruments, Comput., vol. 28, no. 2, pp. 203–208, 1996. [4] Z. S. Harris, “Distributional Structure,” WORD, vol. 10, no. 2–3, pp. 146–162, 1954. [5] T. Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality.,” in NIPS’14, 2013, vol. cs.CL, pp. 3111–3119. [6] J. Howard and S. Ruder, “Fine-tuned Language Models for Text Classification,” arXiv Prepr. arXiv1801.06146, 2018. [7] V. Gulshan et al., “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.,” Jama, vol. 304, no. 6, pp. 649–656, 2016. [8] A. L. Beam and I. S. Kohane, “Translating Artificial Intelligence Into Clinical Care.,” JAMA, vol. 346, no. 8973, pp. 456–7, 2016. [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv Prepr. arXiv1512.03385, 2015. [10] C. Szegedy, V. Vanhoucke, S. Ioffe, and J. Shlens, “Rethinking the inception architecture for computer vision,” arXiv Prepr. arXiv, 2015. [11] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Int. Conf. Learn. Represent., pp. 1–14, 2015. [12] J. Deng, W. Dong, R. Socher, L. Li, and K. Li, “Imagenet: A large-scale hierarchical image database,” Comput. Vis., 2009. [13] O. Levy and Y. Goldberg, “Neural Word Embedding as Implicit Matrix Factorization,” Adv. Neural Inf. Process. Syst., pp. 2177–2185, 2014. [14] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 238–247. [15] D. S. P. 1New Youngduck Choi1, Chill Yi-I Chiu MS1, “Learning Low-Dimensional Representations of Medical Concepts,” AMIA, pp. 373–374, 2016. [16] S. G. Finlayson, P. LePendu, and N. H. Shah, “Building the graph of medicine from millions of clinical narratives.,” Sci. data, vol. 1, p. 140032, 2014. [17] J. A. Minarro-Gimenez, O. Marin-Alonso, and M. Samwald, “Exploring the Application of Deep Learning Techniques on Medical Text Corpora,” in Studies in Health Technology and Informatics, 2014, vol. 205, pp. 584–588. [18] L. De Vine, G. Zuccon, B. Koopman, L. Sitbon, and P. Bruza, “Medical Semantic Similarity with a Neural Language Model,” in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management CIKM ’14, 2014, pp. 1819–1822. [19] S. Moen and T. S. S. Ananiadou, “Distributional semantics resources for biomedical text processing.” LBM, 2013. [20] Y. Liu, T. Ge, K. Mathews, H. Ji, and D. McGuinness, “Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion,” Proc. BioNLP 15, pp. 92–97, 2015. [21] S. Yu and T. Cai, “A short introduction to NILE,” arXiv Prepr. arXiv1311.6063, 2013. [22] K. Donnelly, “SNOMED-CT: The advanced terminology and coding system for eHealth,” Stud. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 17 Health Technol. Inform., vol. 121, p. 279, 2006. [23] O. Bodenreider, “The unified medical language system (UMLS): integrating biomedical terminology,” Nucleic Acids Res., vol. 32, no. suppl_1, pp. D267--D270, 2004. [24] A. H. Jobe and E. Bancalari, “Bronchopulmonary dysplasia,” in American Journal of Respiratory and Critical Care Medicine, 2001, vol. 163, no. 7, pp. 1723–1729. [25] M. C. Staff, “Mayo Clinic: Diseases and Conditions.” [Online]. Available: https://www.mayoclinic.org/diseases-conditions. [Accessed: 01-Jun-2016]. [26] R. Lowell, “The Merck Manual,” Anesth. Prog., vol. 25, no. 3, p. 100, 1978. [27] S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and G. B. Melton, “Semantic similarity and relatedness between clinical terms: an experimental study,” in AMIA annual symposium proceedings, 2010, vol. 2010, p. 572. [28] O. Levy, Y. Goldberg, and I. Dagan, “Improving Distributional Similarity with Lessons Learned from Word Embeddings,” Trans. Assoc. Comput. Linguist., vol. 3, pp. 211–225, 2015. [29] J. Baglama and L. Reichel, “Augmented Implicitly Restarted Lanczos Bidiagonalization Methods,” SIAM Journal on Scientific Computing, vol. 27, no. 1. pp. 19–42, 2005. [30] J. Baglama, L. Reichel, and B. W. Lewis, “irlba: Fast Truncated Singular Value Decomposition and Principal Components Analysis for Large Dense and Sparse Matrices.” 2017. [31] R. \v Reh\r u\v rek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. [32] D. Selivanov, “text2vec: Modern Text Mining Framework for R.” 2016. [33] A. Lex, N. Gehlenborg, H. Strobelt, R. Vuillemot, and H. Pfister, “UpSet: Visualization of intersecting sets,” IEEE Trans. Vis. Comput. Graph., vol. 20, no. 12, pp. 1983–1992, 2014. [34] J. R. Conway, A. Lex, and N. Gehlenborg, “UpSetR: An R package for the visualization of intersecting sets and their properties,” Bioinformatics, vol. 33, no. 18, pp. 2938–2940, 2017. Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 18 Figures and Tables Clinical Concept Embeddings Learned from Massive Sources of Medical Data Beam et. al 2018 19 Figure 1: Intersection of medical concepts found in the insurance claims, clinical notes, and biomedical journal articles (PMC). 54706

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Low-Dimensional Representations of Medical Concepts

We show how to learn low-dimensional representations (embeddings) of a wide range of concepts in medicine, including diseases (e.g., ICD9 codes), medications, procedures, and laboratory tests. We expect that these embeddings will be useful across medical informatics for tasks such as cohort selection and patient summarization. These embeddings are learned using a technique called neural languag...

متن کامل

Topic-Based Embeddings for Learning from Large Knowledge Graphs

We present a scalable probabilistic framework for learning from multi-relational data, given in form of entity-relation-entity triplets, with a potentially massive number of entities and relations (e.g., in multirelational networks, knowledge bases, etc.). We define each triplet via a relation-specific bilinear function of the embeddings of entities associated with it (these embeddings correspo...

متن کامل

Learning Effective Embeddings from Medical Notes

With the large amount of available data and the variety of features they offer, electronic health records (EHR) have gotten a lot of interest over recent years, and start to be widely used by the machine learning and bioinformatics communities. While typical numerical fields such as demographics, vitals, lab measurements, diagnoses and procedures, are natural to use in machine learning models, ...

متن کامل

Improving Implicit Discourse Relation Recognition with Discourse-specific Word Embeddings

We introduce a simple and effective method to learn discourse-specific word embeddings (DSWE) for implicit discourse relation recognition. Specifically, DSWE is learned by performing connective classification on massive explicit discourse data, and capable of capturing discourse relationships between words. On the PDTB data set, using DSWE as features achieves significant improvements over base...

متن کامل

Hybed: Hyperbolic Neural Graph Embedding

Neural embeddings have been used with great success in Natural Language Processing (NLP). They provide compact representations that encapsulate word similarity and attain state-of-the-art performance in a range of linguistic tasks. The success of neural embeddings has prompted significant amounts of research into applications in domains other than language. One such domain is graph-structured d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018